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ABSTRACT 

The effects of social influence and homophily suggest that 
both network structure and node attribute information should 
inform the tasks of link prediction and node attribute infer- 
ence. Recently, Yin et al. [28| |29| proposed Social- Attribute 
Network (SAN), an attribute-augmented social network, to 
integrate network structure and node attributes to perform 
both link prediction and attribute inference. They focused 
on generalizing the random walk with restart algorithm to 
the SAN framework and showed improved performance. In 
this paper, we extend the SAN framework with several lead- 
ing supervised and unsupervised link prediction algorithms 
and demonstrate performance improvement for each algo- 
rithm on both link prediction and attribute inference. More- 
over, we make the novel observation that attribute inference 
can help inform link prediction, i.e., link prediction accu- 
racy is further improved by first inferring missing attributes. 
We comprehensively evaluate these algorithms and compare 
them with other existing algorithms using a novel, large- 
scale Google-l- dataset, which we make publicly availablaj 

Categories and Subject Descriptors 

H. 4 [Information Systems Applications]: Miscellaneous 

General Terms 

Social Network 

Keywords 

Link prediction. Attribute inference, Social-Attribute Net- 
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I. INTRODUCTION 

Online social networks (e.g., Facebook, Google-|-) have be- 
come increasingly important resources for interacting with 
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people, processing information and diffusing social influence. 
Understanding and modeling the mechanisms by which these 
networks evolve are therefore fundamental issues and active 
areas of research. 

The classical Imk prediction problem |l6] has attracted 
particular interest. In this setting, we are given a snapshot 
of a social network at time t and aim to predict links (e.g., 
friendships) that will emerge in the network between t and 
a later time t' . Alternatively, we can imagine the setting 
in which some links existed at time t but are missing at 
t' . In online social networks, a change in privacy settings 
often leads to missing links, e.g., a user on Google-|- might 
decide to hide her family circle between time t and t' . The 
missing link problem has important ramifications as missing 
links can alter estimates of network-level statistics [ll], and 
the ability to infer these missing links raises serious privacy 
concerns for social networks. Since the same algorithms can 
be used to predict new links and missing links, we refer to 
these problems jointly as link prediction. 

Another problem of increasing interest revolves around 
node attributes [3l] . Many real- world networks contain rich 
categorical node attributes, e.g., users in Google-|- have pro- 
files with attributes including employer, school, occupation 
and places lived. In the attribute inference problem, we aim 
to populate attribute information for network nodes with 
missing or incomplete attribute data. This scenario often 
arises in practice when users in online social networks set 
their profiles to be publicly invisible or create an account 
without providing any attribute information. The growing 
interest in this problem is highlighted by the privacy im- 
plications associated with attribute inference as well as the 
importance of attribute information for applications includ- 
ing people search and collaborative filtering. 

In this work, we simultaneously use network structure 
and node attribute information to improve performance of 
both the link prediction and the attribute inference prob- 
lems, motivated by the observed interaction and homophily 
between network structure and node attributes. The prin- 
ciple of social influence [t], which states that users who are 
linked are likely to adopt similar attributes, suggests that 
network structure should inform attribute inference. Other 
evidence of interaction [13[ [10] shows that users with similar 
attributes, or in some cases antithetical attributes, are likely 
to link to one another, motivating the use of attribute in- 
formation for link prediction. Additionally, previous studies 
[12[ [7| have empirically demonstrated those effects on real- 
world social networks, providing further support for consid- 
ering both network structure and node attribute information 



when predicting links or inferring attributes. 

However, the algorithmic question of how to simultane- 
ously incorporate these two sources of information remains 
largely unanswered. The relational learning [26[|20[[30] , ma- 
trix factorization and alignment 19 24] based approaches 
have been proposed to leverage attribute information for 
link prediction, but they suffer from scalability issues. More 
recently, Backstrom and Leskovec [2] presented a Super- 
vised Random Walk (SRW) algorithm for link prediction 
that combines network structure and edge attribute infor- 
mation, but this approach does not fully leverage node at- 
tribute information as it only incorporates node information 
for neighboring nodes. For instance, SRW cannot take ad- 
vantage of the common node attribute San Francisco of U2 
and Its in F ig. [T] sin ce there is no edge between them. 

Yin et al. [29[|28| proposed the use of Social-Attribute Net- 
work (SAN) to gracefully integrate network structure and 
node attributes in a scalable way. They focused on gener- 
alizing Random Walk with Restart (RWwR) algorithm to 
the SAN model to predict links as well as infer node at- 
tributes. In this paper, we generalize several leading su- 



We define the link prediction problem as follows: 



Definition 1 



{Gi,Ai,U) andTj = {Gj,Aj,Lj) 
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pervised and unsupervised link prediction algorithms 
[9] to the SAN model to both predict links and infer miss 
ing attributes. We evaluate these algorithms on a novel, 
large-scale Google-|- dataset, and demonstrate performance 
improvement for each of them. Moreover, we make the novel 
observation that inferring attributes could help predict links, 

1. e., link prediction accuracy is further improved by first in- 
ferring missing node attributes. 

2. PROBLEM DEFINITION 

In our problem setting, we use an undirectecj^ graph G — 
{V, E) to represent a social network, where edges in E repre- 
sent interactions between the A'^ = nodes mV. In addi- 
tion to network structure, we have categorical attributes for 
nodes. For instance, in the Google-|- social network, nodes 
are users, edges represent friendship (or some other relation- 
ship) between users, and node attributes are derived from 
user profile information and include fields such as employer, 
school, and hometown. In this work we restrict our focus to 
categorical variables, though in principle other types of vari- 
ables, e.g., live chats, email messages, real- valued variables, 
etc., could be clustered into categorical variables via vector 
quantization, or directly discretized to categorical variables. 

We use a binary representation for each categorical at- 
tribute. For example, various employers (e.g., Google, In- 
tel and Yahoo) and various schools (e.g., Berkeley, Stanford 
and Yale) are each treated as separate binary attributes. 
Hence, for a specific social network, the number of distinct 
attributes M is finite (though M could be large). Attributes 
of a node u are then represented as a M-dimensional tri- 
nary column vector with the i"* entry equal to 1 when 
u has the i*'' attribute {positive attribute), —1 when u does 
not have it (negative attribute) and when it is unknown 
whether or not u has it {missing attribute). We denote by 
A = [a\ 02 ■ ■ ■ ajv] the attribute matrix for all nodes. Note 
that certain attributes (e.g. Female and Male, age of 20 and 
30) are mutually exclusive. Let L be the set of all pairs of 
mutually exclusive attributes. This set constrains the at- 
tribute matrix A so that no column contains a 1 for two 
mutually exclusive attributes. 

2 

Our model and algorithms can also be generalized to directed graphs. 



(Link Prediction Problem). LetTi — 
be snapshots of a social 
network at times i and j. Then the link prediction problem 
involves using Ti to predict the social network structure Gj . 
When i < j , new links are predicted. When i > j, missing 
links are predicted. 

In this paper, we work with three snapshots of the Google+ 
network crawled at three successive times, denoted Ti = 
(Gi,Ai,Li), T2 = {G2,A2,L2) and T3 = (03,^3,1.3). To 
predict new links, we use various algorithms to solve the link 
prediction problem with i — 2 and J = 3 and first learn any 
required hyperparameters by performing grid search on the 
link prediction problem with i = 1 and j = 2. Similarly, to 
predict missing links, we solve the link prediction problem 
with i = 2 and j = 1 and learn hyperparameters via grid 
search with i = 3 and j = 2. 

For any given snapshot, several entries of A will be zero, 
corresponding to missing attributes. The attribute infer- 
ence problem, which involves only a single snapshot of the 
network, is defined as follows: 

Definition 2 (Attribute Inference Problem). Let 
T = (G, a, L) be a snapshot of a social network. Then the 
attribute inference problem is to infer whether each zero en- 
try of A corresponds to a positive or negative attribute, sub- 
ject to the constraints listed in L. 

Our goal is to design scalable algorithms leveraging both 
network structure and rich node attributes to address these 
problems for real- world large-scale networks. 

3. MODEL AND ALGORITHMS 
3.1 Social- Attribute Network Model 

Social-Attribute Network was first proposed by Yin et 
al. [28] [29|^to predict links and infer attributes. However, 
their original model didn't consider negative and mutually 
exclusive attributes. In this section, we review this model 
and extend it to incorporate negative and mutex attributes. 

Given a social network G with M distinct categorical at- 
tributes, an attribute matrix A and mutex attributes set L, 
we create an augmented network by adding M additional 
nodes to G, with each additional node corresponding to an 
attribute. For each node u in G with positive or negative 
attribute a, we create an undirected link between u and a 
in the augmented network. For each mutually exclusive at- 
tribute pair (a, 6), we create an undirected link between a 
and b. This augmented network is called the Social-Attribute 
Network (SAN) since it includes the original social network 
interactions, relations between nodes and their attributes 
and mutex links between attributes. 

Nodes in the SAN model corresponding to nodes in G are 
called social nodes, and nodes representing attributes are 
called attribute nodes. Links between social nodes are called 
social links, and links between social nodes and attribute 
nodes are called attribute links. Attribute link {u, a) is a 
positive attribute link if a is a positive attribute of node u, 
and it is a negative attribute link otherwise. Links between 
3 

Note that they name this model as Augmented Graph. We call it as 
Social-Attribute Network because it's more meaningful. 
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Figure 1: Illustration of a Social- Attribute Network 
(SAN). The link prediction problem reduces to predict- 
ing social links while the attribute inference problem in- 
volves predicting attribute links. 

mutually exclusive attribute nodes are called mutex links. 
Intuitively, the SAN model explicitly describes the sharing of 
attributes across social nodes as well as the mutual exclusion 
between attributes, as illustrated in the sample SAN model 
of Fig. [T] Moreover, with the SAN model, the link prediction 
problem reduces to predicting social links and the attribute 
inference problem involves predicting attribute links. 

We also place weights on the various nodes and edges 
in the SAN model. These node and edge weights describe 
the relative importance of individual nodes or relationships 
across nodes and can also be used in a global fashion to 
balance the influence of social nodes versus attribute nodes 
and social links versus attribute links. We use w{u) and 
■w{u, v) to denote the weight of node u and the weight of 
link {u,v), respectively. Additionally, for a given social or 
attribute node u in the SAN model, we denote by r+(u) and 
Fs+iu) respectively the set of all neighbors and the set of 
social neighbors connected to u via social links or positive at- 
tribute links. We define T-{u) and Ts-{u) in a similar fash- 
ion. This terminology will prove useful when we describe 
our generalization of leading link prediction algorithms to 
the SAN model in the next section. 

The fact that no social node can be linked to multiple mu- 
tex attributes is encoded in the mutex property, i.e., there is 
no triangle consisting of a mutex link and two positive at- 
tribute links in any social-attribute network, which enforces 
a set of constraints for all attribute inference algorithms. 

In this work, we focus primarily on node attributes. How- 
ever, we note that the SAN model can be naturally extended 
to incorporate edge attributes. Indeed, we can use a func- 
tion (e.g., the logistic function) to map a given set of at- 
tributes for each edge (e.g., edge age) into the real- valued 
edge weights of the SAN model. The attributes-to-weight 
mapping function can be learned using an approach similar 
to the one proposed by Backstrom and Leskovec [2j. 

3.2 Algorithms 

Link prediction algorithms typically compute a probabilis- 
tic score for each candidate link and subsequently rank these 
scores and choose the largest ones (up to some threshold) 
as putative new or missing links. In the following, we ex- 
tend both unsupervised and supervised algorithms to the 
SAN model. Furthermore, we note that when predicting at- 
tribute links, the SAN model features a post-processing step 
whereby we change the lowest ranked putative positive links 
violating the mutex property to negative links. 



3. 2. 1 Unsupervised Link and Attribute Inference 

Liben-Nowell and Kleinberg fTol provide a comprehen- 
sive survey of unsupervised link prediction algorithms for 
social networks. These algorithms can be roughly divided 
into two categories: local-neighborhood-based algorithms 
and global-structure-based algorithms. In principle, all of 
the algorithms discussed in [16] can be generalized for the 
SAN model. In this work we focus on representative algo- 
rithms from both categories and we describe below how to 
generalize them to the SAN model to predict both social 
links and attribute links. We add the sufBx '-SAN' to each 
algorithm name to indicate its generalization to the SAN 
model. In our presentation of the unsupervised algorithms, 
we only consider positive attribute links, though many of 
these algorithms can be extended to signed networks |25| . 

Common Neighbor (CN-SAN) is a local algorithm that 
computes a score for a candidate social or attribute link 
(u, v) as the sum of weights of u and v's common neighbors, 
i.e. score{u,v) = $I^tgr+(u)nr+{u) Conventional CN 
only considers common social neighbors. 

Adamic-Adar (AA-SAN) is also a local algorithm. For 
a candidate social link (u, v) the AA-SAN score is 



score{u, v) = 

tGr+{u)nr+(u) 



iog|r.+(t)| 



Conventional AA, initially proposed in [I] to predict friend 
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to pre- 



ships on the web and subsequently adapted by 
diet links in social networks, only considers common social 
neighbors. AA-SAN weights the importance of a common 
neighbor proportional to the inverse of the log of social de- 
gree. Intuitively, we want to downweight the importance 
of neighbors that are either i) social nodes that are social 
hubs or ii) attribute nodes corresponding to attributes that 
are widespread across social nodes. Since in both cases this 
weight depends on the social degree of a neighbor, the AA- 
SAN weight is derived based on social degree, rather than 
total degree. 

In contrast, for a candidate attribute link (u, a), the at- 
tribute degree of a common neighbor does influence the im- 
portance of the neighbor. For instance, consider two so- 
cial nodes with the same social degree that are both com- 
mon neighbors of nodes u and a. If the first of these social 
nodes has only two attribute neighbors while the second has 
1000 attribute neighbors, the importance of the former so- 
cial node should be greater with respect to the candidate 
attribute link. Thus, AA-SAN computes the score for can- 
didate attribute link (m, a) as 



score{u, a) 



E 



w{t) 



ters+(u)nr3+(a) 



iog!r+(t)| 



Low-rank Approximation (LRA-SAN) takes advantage 
of global structure, in contrast to CN-SAN and AA-SAN. 
Denote Xs as the N x N weighted social adjacency matrix 
where the {u, v)th entry of Xs is w{u, v) if {u, v) is a social 
link and zero otherwise. Similarly, let Xa be the N x M 
weighted attribute adjacency matrix where the {u, a)th en- 
try of Xa is w{u, a) if («, o) is a positive attribute link and 
zero otherwise. We then obtain the weighted adjacency ma- 
trix X for the SAN model by concatenating Xs and Xa, 



i.e., X = [Xs Xa\- The LRA-SAN method assumes that a 
small number of latent factors (approximately) describe the 
social and attribute link strengths within X and attempts to 
extract these factors via low-rank approximation of X, de- 
noted by X. The LRA-SAN score for a candidate social or 
attribute link (u, t) is then simply Xut, or the {u, t)th entry 
of X. LRA-SAN can be computed efficiently via truncated 
Singular Value Decomposition (SVD). 

CN + Low-rank Approximation (CN+LRA-SAN) is 

a mixture of local and global methods, as it first performs 
CN-SAN using a SAN model and then performs low-rank ap- 
proximation on the resulting score matrix. After performing 
CN-SAN, let Ss be the resulting N x N score matrix for all 
social node pairs and Sa be the resulting NxM score matrix 
for all social-attribute node pairs. By virtue of the CN-SAN 
algorithm, note that Ss includes attribute information and 
Sa includes social interactions. CN-I-LRA-SAN then pre- 
dicts social links by computing a low-rank approximation of 
Ss denoted Ss , and each entry of Ss is the predicted social 
link score. Similarly, Sa is a low-rank approximation of Sa, 
and each entry of Sa is the predicted score for the corre- 
sponding attribute link|^ 

AA + low-rank Approximation(AA-|-LRA-SAN) is 

identical to CN-I-LRA-SAN but with the score matrices Ss 
and Sa generated via the AA-SAN algorithm. 



Random Walk with Restart (RWwR-SAN) 29 is a 



global algorithm. In the SAN model, a Random Walk with 
Restart [4j[2l] starting from u recursively walks to one of its 
neighbors t with probability proportional to the link weight 
w{u, t) and returns to u with a fixed restart probability a. 
The probability Pu.v is the stationary probability of node 
u in a random walk with restart initiated at u. In general, 
Pu,v 7^ Pv,u- For a candidate social link {u,v), we compute 
Pu,v and Pv,u and let score{u,v) — {Pu,v + Pv,u)/'2. Note 
that RWwR for link prediction in previous work [16] com- 
putes these stationary probabilities based only on the social 
network. For a candidate attribute link (u, a), RWwR-SAN 
only computes Pu,a, and Pu,a is taken as the score of {u, a). 

We finally note that for predicting social links, if we set the 
weights of all attribute nodes and all attribute links to zero 
and we set the weights of all social nodes and social links to 
one, then all the algorithms des cribed above reduce to their 
standard forms described in [l6]|^ In other words, we recover 
the link prediction algorithms on pure social networks. 

3. 2. 2 Supervised Link and Attribute Inference 

Link prediction can be cast as a binary classification prob- 
lem, in which we first construct features for links, and then 
use a classifier such as SVMs or Logistic Regression. In con- 
trast to unsupervised attribute inference, negative attribute 
links are needed in supervised attribute inference. 



An alternative method for combining CN-SAN and LRA-SAN under 
the SAN model that was not explored in this worlc involves defining 
S — [Ss Sa], approximating S with S and using the {u, t)th entry of 
S as a score for link {u, t). 

^For LRA-SAN this implies that Xa is an x M matrix of zeros, 
so the truncated SVD of X is equivalent to that of Xs except for M 
zeros appended to the right singular vectors of Xs ■ 




Number of node attributes 



Figure 2: The fraction of users as a function of the num- 
ber of node attributes in the Google+ social network. 

Supervised Link Prediction (SLP-SAN) For each link 
in our training set, we can extract a set of topological fea- 
tures F (e.g. CN, AA, etc.) computed from pure social net- 
works and the similar features F_SAN computed from the 
corresponding social-attribute networks. We explored 4 fea- 
ture combinations: i) SLP-I uses only topological features F 
computed from social networks; ii) SLP-II uses topological 
features F as well as an aggregate feature, i.e., the num- 
ber of common attributes of the two endpoints of a link; 
iii) SLP-SAN-III uses topological features F^SAN; and iv) 
SLP-SAN- VI uses topological features F and F_SAN. SLP- 
SAN-III and SLP-SAN- VI contain the substring 'SAN' be- 
cause they use features extracted from the SAN model. SLP- 
I and SLP-II are widely used in previous work [ol |17| [2] . 



Supervised Attribute Inference (SAI-SAN) Recall that 
attribute inference is transformed to attribute link predic- 
tion with the SAN model. We can extract a set of topolog- 
ical features for each positive and negative attribute link. 
Moreover, the positive attribute links are taken as positive 
examples while the negative attribute links are taken as neg- 
ative examples. Hence, we can train a binary classifier for 
attribute links and then apply it to infer the missing at- 
tribute links. 

3. 2. 3 Iterative Link and Attribute Inference 

In many real-world networks, most node attributes are 
missing. Fig. [2] shows the fraction of users as a function of 
the number of node attributes in Google+ social network. 
From this figure, we see that roughly 70% of users have no 
observed node attributes. Hence, we will also investigate an 
iterative variant of the SAN model. We first infer the top 
attributes for users without any observed attributes. We 
then update the SAN model to include these predicted at- 
tributes and perform link prediction on the updated SAN 
model. This process can be performed for several iterations. 

4. GOOGLE+ DATA 

Google launched its new social network service named 
Google-I- in early July 2011. We crawled three snapshots of 
the Google-l- social network and their users' profiles on July 
19, August 6 and September 19 in 2011. They are denoted 
as JUL, AUG and SEP, respectively. We then pre-processed 
the data before conducting link prediction and attribute in- 
ference experiments. 

Preprocessing Social Networks In Google+, users divide 
their social connections into circles, such as a family circle 
and a friends circle. If user u is in v's circle, then there is a di- 



rected edge (v, u) in the graph, and thus the Google+ dataset 
is a directed social graph. We converted this dataset into an 
undirected graph by only retaining edges {u, v) if both di- 
rected edges {u, v) and {v, u) exist in the original graph. We 
chose to adopt this filtering step for two reasons: (1) Bidi- 
rectional edges represent mutual friendships and hence rep- 
resent a stronger type of relationship that is more likely to 
be useful when inferring users' attributes from their friends' 
attributes (2) We reduce the influence of spammers who add 
people into their circles without those people adding them 
back. Spammers introduce fictitious directional edges into 
the social graph that adversely influence the performance of 
link prediction algorithms. 

Collecting Attribute Vocabulary Google-|- proflles in- 
clude short entries about users such as Occupation, Em- 
ployment, Education, Places Lived, and Gender, etc. We 
use Employment and Education to construct a vocabulary 
of attributes in this paper. We treat each distinct employer 
or school entity as a distinct attribute. Google-f- has prede- 
fined employer and school entities, although users can still 
fill in their own defined entities. Due to users' changing pri- 
vacy settings, some profiles in JUL are not found in AUG 
and SEP, so we use JUL to construct our attribute vocab- 
ulary. Specifically, from the profiles in JUL, we list all at- 
tributes and compute frequency of appearance for each at- 
tribute. Our attribute vocabulary is constructed by keeping 
attributes with frequency of at least 3. 

Constructing Social- Attribute Networks In order to 
demonstrate that the SAN model leverages node attributes 
well, we derived social-attribute networks in which each node 
has some positive attributes from the above Google-I- so- 
cial networks and attribute vocabulary. Specifically, for an 
attribute-frequency threshold k, we chose the largest con- 
nected social network from JUL such that each node has 
at least k distinct positive attributes. We also found the 
corresponding social networks consisting of these nodes in 
snapshots AUG and SEP. Social-attribute networks were 
then constructed with the chosen social networks and the 
attributes of the nodes. Specifically, we chose k — {2,4} 
to construct 6 social-attribute networks whose statistics are 
shown in Table[T] Each social-attribute network is named by 
concatenating the snapshot name and the attribute-frequency 
threshold. For example, 'JUL4' is the social-attribute net- 
work constructed using JUL and k = 4. These names are 
indicated in the first column of the table. 

In the crawled raw networks, some social links in JULi 
are missing in AUGz and SEPi, where i = 2, 4. These links 
are missing due to one of two events occurring between the 
JUL and AUG or SEP snapshots: 1) users block other users, 
or 2) users set (part of) their circles to be publicly invisi- 
ble after which point they cannot be publicly crawled. These 
missed links provide ground truth labels for our experiments 
of predicting missing links. However, these missing links can 
alter estimates of network-level statistics, and can have un- 
expected influences on link prediction algorithms [lT| . More- 
over, it is likely in practice that companies like Facebook and 
Google keep records of these missing links, and so it is rea- 
sonable to add these links back to AUGi and SEPi for our 
link prediction experiments. The third column in Table [l] is 
the number of all social links after fllling the missing links 
into AUGi and SEPi. The second column #soci links is used 



Table 1: Statistics of social-attribute networks. 
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for experiments of predicting missing links, and column #all 
soci links is used for the experiments of predicting new links. 

From these two columns, the number of new links or miss- 
ing links can be easily computed. For example, if we use 
AUG2 as training data and SEP2 as testing data for link pre- 
diction, the number of new links is 354572 — 339059 = 15513, 
which is computed with entries in column #a// soci links. If 
we use AUG2 as training data and JUL2 as testing data 
in predicting missing links, the number of missing links is 
339059 - 328761 = 10298, which is computed with corre- 
sponding entries in column #soci links and #all soci links. 

5. EXPERIMENTS 
5.1 Experimental Setup 

In our experiments, the main metric used is AUG, Area 
Under the Receiver Operating Characteristic (ROG) Curve, 
which is widely used in the machine learning and social net- 
work communities [5| [2] . AUG is computed in the manner 
described in [s], in which both positive and negative exam- 
ples are required. In principle, we could use new links or 
missing links as positive examples and all non-existing links 
as negative examples. However, large-scale social networks 
tend to be very sparse, e.g., the average degree is 4.17 in 
SEP2, and, as a result, the number of non-existing links can 
be enormous, e.g., SEP2 has around 2.9 x 10^" non-existing 
links. Hence, computing AUG using all non-existing links 
in large-scale networks is typically computationally infeasi- 
ble. Moreover, the majority of new links in typical online 
social networks close triangles [14[ [2], i.e., are hop-2 links. 
For instance, we flnd that 58% of the newly added links in 
Google-f are hop-2 links. We thus evaluate our large net- 
work experiments using hop-2 link data as in [2], i.e., new 
or missing hop-2 links are treated as positive examples and 
non-existing hop-2 links are treated as negative examples. 

In a social-attribute network, there are two categories of 
hop-2 links: 1) those with two endpoints sharing at least 
one common social node, and 2) those with two endpoints 
sharing only common attribute nodes. Local algorithms ap- 
plied to the original social network are unable to predict 
hop-2 links in the second category. Thus, we evaluate only 
with respect to hop-2 links in the first category, so as not to 
give unfair advantage to algorithms running on the social- 
attribute network. To better understand whether the AUG 
performance computed on hop-2 links can be generalized 
to performance on any-hop links, we additionally compute 
AUG using any-hop links on the smaller Google+ networks. 

In general, different nodes and links can have different 
weights in social-attribute networks, representing their rel- 
ative importance in the network. In all of our experiments 
in this paper, we set all weights to be one and leave it for 
future work to learn weights. 

We compare our link prediction algorithms with Super- 
vised Random Walk (SRW) [5], which leverages edge at- 
tributes, by transforming node attributes to edge attributes. 
Specifically, we compute the number of common attributes 



Table 2: Results for predicting new links. (a)AUC of hop-2 new links on the train-test pair AUG4-SEP4. (b)AUC 
of hop-2 new links on the train-test pair AUG2-SEP2. (c) (d) AUG of any hop new links on the train-test pair 
AUG4-SEP4. The numbers in parentheses are standard deviations. 



(a) 



(b) 



(c) 



Alg 


w/o Attri 


With Attri 


Aig 


w/o Attri 


With Attri 


Aig 


w/o Attri 


With Attri 


Kandom 


0.5000 


0.5000 


Kandom 


0.5000 


0.5000 


Handom 


0.5000 


0.5000 


CN-SAN 


0.6730 


0.7315 


CN-SAN 


0.6936 


0.7508 


CN-SAN 


0.7482 


0.8298 


AA-SAN 


0.7109 


0.7476 


AA-SAN 


0.7638 


0.7895 


AA-SAN 


0.7483 


0.8324 


LHA-SAN 


0.6003 


0.6262 


LHA-SAN 


0.6410 


0.6385 


LHA-SAN 


0.8075 


0.8237 


CN+LHA-SAN 


0.6969 


o.reri 


CN+LRA-SAN 


0.5642 


0.6373 


CN+LHA-SAN 


0.7857 


0.8651 


AA+LHA-SAN 


0.7118 


0.7471 


AA+LHA-SAN 


0.6032 


0.6557 


AA+LHA-SAN 


0.8193 


0.8552 


RWwH-SAN 


0.6033 


0.6143 


HWwH-SAN 


0.6788 


0.6912 


HWwH-SAN 


0.9363 


0.9548 



(d) 



Alg 


AUC 


SLP-I 


0.9128(0.0140) 


SLP-II 


0.9580 


0.0017) 


SLP-SAN-III 


0.9450 


0.0007) 


SLP-SAN-VI 


0.9706 


0.0004) 


SHW 


0.9383 



of the two endpoints of each existing link. As in [2], we 
also use the number of common neighbors as an edge at- 
tribute. We adopt the Wilcoxon-Mann-Whitney (WMW) 
loss function and logistic edge strength function in our im- 
plementations as recommended in [2]. 

We compare our attribute inference algorithms with two 
algorithms, BASELINE and LINK, introduced by Zheleva 
and Getoor [3l]. Using only node attributes, BASELINE 
first computes a marginal attribute distribution and then 
uses an attribute's probabihty as its score. LINK trains a 
classifier for each attribute by flattening nodes as the rows of 
the adjacency matrix of the social networks]^ Zheleva and 
Getoor [3l] found that LINK is the best algorithm when 
group memberships are not available. 

We use SVM as our classifier in all supervised algorithms. 
For link prediction, we extract six topological features (CN- 
SAN, AA-SAN, LRA-SAN, CN+LRA-SAN, AA+LRA-SAN 
and RWwR-SAN) from both pure social networks and social- 
attribute networks. Hence, SLP-I, SLP-II, SLP-SAN-III and 
SLP-SAN-VI use 6, 7, 6 and 12 features, respectively. For at- 
tribute inference, we extract 9 topological features for each 
attribute link. We adopt two ranks (detailed in 5.2.2 1 for 
each low-rank approximation based algorithms, thus obtain- 
ing 6 features. The other three features are CN-SAN, AA- 
SAN and RWwR-SAN. To account for the highly imbalanced 
class distribution of examples for supervised link prediction 
and attribute inference we downsample negative examples 
so that we have equal number of positive and negative ex- 
amples (techniques proposed in [17[ |6] could be used to 
further improve the performance). 

We use the pattern datasetl -dataset2 to denote a train- 
test or train- validation pair, with datasetl a training dataset 
and datasetS a testing or validation dataset. When con- 
ducting experiments to predict new links on the AUGi-SEPi 
train-test pair, SRW, classifiers and hyperparameters of global 
algorithms, i.e., ranks in LRA-SAN, CN+LRA-SAN, and 
AA+LRA-SAN and the restart probability a in RWwR- 
SAN, are learned on the JULi-AUGi train-validation pair. 
Similarly, when predicting missing links on train-test pair 
AUGi-JULi, they are learned on train-validation pair SEPi- 
AUGi, where i = 2, 4. 

The CN-SAN and AA-SAN algorithms are implemented 
in Python 2.7 while the RWwR-SAN algorithm and Super- 
vised Random Walk (SRW) are implemented in Matlab, and 
all of them are run on a desktop with a 3.06 GHz Intel Core 
13 and 4GB of main memory. LRA-SAN, CN+LRA-SAN 
and AA+LRA-SAN algorithms are implemented in Matlab 



and run on an x86-64 architecture using a single 2.60 Ghz 
core and 30GB of main memory. 

5.2 Experimental Results 

In this section we present evaluations of the algorithms 
on the Google+ dataset. We first show that incorporat- 
ing attributes via the SAN model improves the performance 
of both unsupervised and supervised link prediction algo- 
rithms. Then we demonstrate that inferring attributes via 
link prediction algorithms within the SAN model achieves 
state-of-the-art performance. Finally, we show that by com- 
bining attribute inference and link prediction in an iterative 
fashion, we achieve even greater accuracy on the link pre- 
diction task. 




The original LINK algorithm |3lj trained a distinct classifier for each 
attribute type. In our setting an attribute type, (e.g., Education) can 
have multiple values, so we train a classifier for each binary attribute 
value. 



Figure 3: ROG curves of the GN+LRA-SAN algorithm 
for predicting new links. AUG4-SEP4 is the train-test 
pair. JUL4-AUG4 is the train-validation pair. 

5. 2. 1 Link Prediction 

To demonstrate the benefits of combining node attributes 
and network structure, we run the SAN-based link predic- 
tion algorithms described in Section [3.2| both on the original 
social networks and on the corresponding social-attribute 
networks (recall that the SAN-based unsupervised algorithms 
reduce to standard unsupervised link prediction algorithms 
when working solely with the original social networks). 

Predicting New Links Table |2] shows the AUC results 
of predicting new links for each of our datasets. We are 
able to draw a number of conclusions from these results. 
First, the SAN model improves every unsupervised learning 
algorithm on every dataset, save for LRA-SAN on AUG2- 
SEP2. Second, Table [2d] shows that attributes also improve 
supervised link prediction performance since SLP-SAN-VI, 
SLP-SAN-III and SLP-II outperform SLP-I. Moreover, SLP- 
SAN-VI, which adopts features extracted from both social 
networks and social-attribute networks, achieves the best 
performance, thus demonstrating the power of the SAN model. 
Third, comparing RWwR-SAN in Table |2c] and SRW in 
Table |2dl we observe that the SAN model is better than 
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Figure 4: Performance of various algorithms on attribute 
inference on SEP4. (a) AUG under ROC curves, (b) 
Pre@2,3,4. 

SRW at leveraging node attributes since RWwR-SAN with 
attributes outperforms SRW. This result is not surprising 
given that SRW is designed for edge attributes and when 
transforming node attributes to edge attributes, we lose some 
information. For instance, as illustrated in Fig. [l] nodes 
U2 and Us share the attribute San Francisco. When trans- 
forming node attributes to edge attributes, this common at- 
tribute information is lost since 112 and 115 are not linked. 

Fig.[3]shows the ROC curves of the CN-fLRA-SAN algo- 
rithm. We see that curve of CN-I-LRA-SAN with attributes 
dominates that of CN-I-LRA-SAN without attributes, demon- 
strating the power of the SAN model to effectively incorpo- 
rate the additional predictive information of attributes. 

Predicting Missing Links Missing links can be divided 
into two categories: 1) links whose two endpoints have some 
social links in the training dataset. 2) links with at least 
one endpoint that has no social links in the training dataset. 
Category 1 corresponds to the scenarios where users block 
users or users set a part of their friend lists (e.g. family cir- 
cles) to be private. Category 2 corresponds to the scenario 
in which users hide their entire friend lists. Note that all 
hop-2 missing links belong to Category 1. In addition to 
performing experiments to show that the SAN model im- 
proves missing link prediction, we also perform experiments 
to explore which category of missing links is easier to pre- 
dict. Table |3] shows the results of predicting missing links 
on various datasets. As in the new-link prediction setting, 
the performance of every algorithm is improved by the SAN 
model, except for LRA-SAN on AUG4-JUL4 and RWwR- 
SAN on AUG4-JUL4 for hop-2 missing links. 

When comparing Tables |3d| and [3e] or Tables [3c] and 
we conclude that the missing links in Category 2 are harder 
to predict than those in Category 1. RWwR-SAN without 
attributes performs poorly when predicting any-hop miss- 
ing links in both categories (as indicated by the entry with 
0.2000 in Table 3d I. This poor performance is due to the 



fact that RWwR-SAN without attributes assigns zero scores 
for all the missing links in Category 2 (positive examples) 
and positive scores for most non-existing links (negative ex- 
amples), making many negative examples rank higher than 
positive examples and resulting in a very low AUC. 

5.2.2 Attribute Inference 

In this section, we focus on inferring attributes using the 



SAN model. In our next set of experiments in Section [5.2.3| 
we use the results of these attribute inference algorithms 
to further improve link prediction, and the results of this 
iterative approach further validate the performance of the 
SAN model for attribute inference. Since the first step of 
iterative approach of Section [5 . 2 . 3| involves inferring the top 
attributes for each node, we employ an additional perfor- 
mance metric called Pre@Jf in our attribute inference ex- 
periments. Compared to AUC, Pre@i(" better captures the 
quality of the top attribute predictions for each user. Specif- 
ically, for each sampled user, the top-K predicted attributes 
are selected, and (unnormalized) PreOiS' is then defined as 
the number of positive attributes selected divided by the 
number of sampled users. We address score ties in the man- 
ner described in [Ts] . Since most Google-l- users have a small 
number of attributes, we set isT = 2, 3, 4 in our experiments. 

When evaluating algorithms for the inference of missing 
attributes, we require ground truth data. In general, ground 
truth for node attributes is difficult to obtain since it is often 
not possible to distinguish between negative and missing at- 
tributes. However, for most users the number of attributes 
is quite small, and so we assume that users with many posi- 
tive attributes have no missing attributes. Hence, we evalu- 
ate attribute inference on users that have at least 4 specified 
attributes, i.e., we work with users in SEP4 and assume that 
each attribute link in SEP4 is either positive or negative. 

In our experiment, we sample 10% of the users in SEP4 
uniformly at random, remove their attribute links from SEP4, 
and evaluate the accuracy with which we can infer these 
users' attributes. All removed positive attribute links are 
viewed as positive examples, while all the negative attribute 
links of the sampled users are treated as negative examples. 
We run a variety of algorithms for attribute inference, and 
for each algorithm we average the results over 10 random 
trials. As noted above, we evaluate the performance of at- 
tribute inference using both AUC and Vxe'ikK. 

For the low-rank approximation based algorithms, i.e., 
LRA-SAN, CN+LRA-SAN and AA-^LRA-SAN, we report 
results using two different ranks, 100 and 1000, and indicate 
which was used by the number following the algorithm name 
in Fig. [4] We choose these two small ranks for computational 
reasons and also based on the fact that low-rank approxima- 
tion methods assume that a small number of latent factors 
(approximately) describe the social-attribute networks. For 
RWwR-SAN, we set the restart probability a to be 0.7[^ 

Fig. [4] shows the attribute inference results for various al- 
gorithms. Several interesting observations can be made from 
this figure. First, under both metrics, all SAN-based algo- 
rithms perform better than BASELINE, save LRAIOO-SAN 
and LRAIOOO-SAN under Pre@2,3,4 metric, which indicates 
that the SAN model is good at leveraging network structure 
to infer missing attributes. Second, we find that AUC and 
Pre®-/^ provide inconsistent conclusions about relative al- 
gorithm performance. For instance, the mean AUC values 
suggest that SAI-SAN beats all other algorithms. However, 
several unsupervised algorithms outperform SAI-SAN with 
respect to Pre@2,3,4. The inconsistencies between the two 
metrics are expected since AUC is a global measurement 
while PreQisT is a local one. Our SAI-SAN algorithm dom- 
inates LINK under both AUC and Pre@2,3,4 metrics, thus 
demonstrating the power of mapping attribute inference to 



We find that RWwR-SAN performs consistently across different 
restart probabilities (results omitted due to space constraints). 



Table 3: Results for predicting missing links, (a) AUG of hop-2 missing links on the train-test pair AUG4-JUL4. (b) 
AUG of hop-2 missing links on the train-test pair AUG2-JUL2. (c)-(f) AUG of any-hop missing links on the train-test 
pair AUG4-JUL4. Missing links in both categories 1 and 2 are used in (c) and (d). Missing links in Gategory 1 are 
used in (e) and (f). The numbers in parentheses are standard deviations. 



(a) 



(b) 



(c) 



Alg 


w/o Attri 


With Attri 


Random 


0.5000 


0.5000 


CN-SAN 


0.7180 


0.7925 


AA-SAN 


0.7437 


0.7697 


LHA-SAN 


0.6569 


0.6237 


CN+LHA-SAN 


0.7147 


0.7986 


AA+LHA-SAN 


0.7410 


0.7668 


HWwH-SAN 


0.5731 


0.5676 


(d) 


Alg 


w/o Attri 


With Attri 


Handom 


0.5000 


0.5000 


CN-SAN 


0.5460 


0.7012 


AA-SAN 


0.5460 


0.7033 


LHA-SAN 


0.5495 


0.6177 


CN-fLRA-SAN 


0.5547 


0.7048 


AA-f LHA-SAN 


0.5640 


0.7325 


HWwH-SAN 


0.2000 


0.7619 



Alg 


w/o Attri 


With Attri 


Handom 


0.5000 


0.5000 


CN-SAN 


0.6938 


0.7309 


AA-SAN 


0.7633 


0.7796 


LHA-SAN 


0.6044 


0.6059 


CN+LHA-SAN 


0.5816 


0.6266 


AA-l-LHA-SAN 


0.6212 


0.6569 


HWwH-SAN 


0.6595 


0.6706 


(e) 


Alg 


w/o Attri 


With Attri 


Handom 


0.5000 


0.5000 


CN-SAN 


0.7329 


0.7765 


AA-SAN 


0.7330 


0.7784 


LHA-SAN 


0.7316 


0.7401 


CN+LHA-SAN 


0.7515 


0.7510 


AA+LHA-SAN 


0.8104 


0.8116 


HWwH-SAN 


0.7797 


0.8S38 



Alg 


AUC 


SLP-I 


0.5453(0.0120) 


SLP-II 


0.6991(0.0065) 


SLP-SAN-III 


0.7161(0.0030) 


SLP-SAN-VI 


0.8481(0.0022) 


(f) 


Alg 


AUC 


SLP-I 


0.8023(0.0088) 


SLP-II 


0.8403(0.0033) 


SLP-SAN-III 


0.8620(0.0080) 


SLP-SAN-VI 


0.8854(0.0324) 



link prediction with the SAN model. 

Table 4: Results for iteratively inferring attributes and 
predicting links, (a) on the AUG4-SEP4 train-test pair, 
(b) on the AUG4-JUL4 train-test pair. Results are av- 
eraged over 10 trials. The numbers in parentheses are 
standard deviations. 



(a) 



Alg 


w/o Attri 


With Attri 


With Inferred Attri 


Random 


0.5000(0) 


0.5000(0) 


0.5000(0) 


CN-SAN 


0.6730(0) 


0.7174(0.0077) 


0.7291(0.0063) 


AA-SAN 


0.7109(0) 


0.7408(0.0063) 


0.7440(0.0026) 


LHA-SAN 


0.6003(0) 


0.6274(0.0052) 


0.6320(0.0055) 


CN+LHA-SAN 


0.6969(0) 


0.7497(0.0134) 


0.7534(0.0084) 


AA+LHA-SAN 


0.7111(0) 


0.7373(0.0050) 


0.7442(0.0032) 



(b) 



Alg 


w/o Attri 


With Attri 


With Inferred Attri 


Handom 


0.5000(0) 


0.5000(0) 


0.5000(0) 


CN-SAN 


0.7180(0) 


0.7780(0.0173) 


0.7856(0.0100) 


AA-SAN 


0.7437(0) 


0.7626(0.0100) 


0.7661(0.0045) 


LHA-SAN 


0.6569(0) 


0.6189(0.0105) 


0.6134(0.0157) 


CN+LHA-SAN 


0.7147(0) 


0.7838(0.0256) 


0.7969(0.0059) 


AA+LHA-SAN 


0.7410(0) 


0.7591(0.0118) 


0.7673(0.0051) 



5.2. 3 Iterative Attribute and Link Inference 

Section [5 . 2 . 1 1 demonstrated that knowledge of a user's at- 
tributes can lead to significant improvements in link predic- 
tion. However, in real- world social networks like Google+, 
the vast majority of user attributes are missing (see Fig. |2|. 
To increase the realized benefits of social-attribute networks 
with few attributes, we propose first inferring missing at- 
tributes for each user whose attributes are missing and then 
performing link prediction on the inferred social-attribute 
networks. Recall that SAI-SAN achieves the best AUC, 
RWwR-SAN achieves the best Vm^K in inferring attributes 
(see Fig. 4| and AA-SAN achieves comparable Pre@_K" re- 
sults while being more scalable. Thus, in the following ex- 
periments, we use AA-SAN to first infer the top- A" missing 
attributes for users, and subsequently perform link predic- 
tion using various methods. 

In our experiments, when we are working on the pair tram- 
test, we sample 10% of the users of train uniformly at ran- 
dom and remove their attributes. We then run three vari- 



ants of link prediction algorithms: i) without attributes, ii) 
with only the remaining attributes, and iii) with the remain- 
ing attributes along with the inferred attributes. The top-4 
attributes are inferred for each sampled user by AA-SAN. 
We report the results averaged over 10 trials. The hyper- 
parameters of the global algorithms are the same as those 
in (Section 5.2.1 1, which are learned from the corresponding 



train-validation pair. 

Table [4a| shows the results of first inferring attributes and 
then predicting new links on the AUG4-SEP4 train-test pair. 
Table |4b| shows the results of first inferring attributes and 
then predicting missing links on the AUG4-JUL4 train-test 
pair. We see that the inferred attributes improve the per- 
formance of all algorithms except LRA-SAN on predicting 
missing links, which is unable to make use of attributes as 
demonstrated earlier in Table l3al The AUCs obtained with 
inferred attributes for all other algorithms are very close to 
those obtained with all positive attributes as shown in Table 
I2al This further demonstrates that AA-SAN is an effective 
algorithm for attribute inference. 

6. RELATED WORK 

A wide range of link prediction methods have been de- 
veloped. Liben-Nowell and Kleinberg [l^ surveyed a set of 
unsupervised link prediction algorithms. Li [l5] proposed 
Maximal Entropy Random Walk (MERW). Lichtenwalter 
et al. |T7j proposed the PropFlow algorithm which is sim- 
ilar to RWwR but more localized. However, none of these 
approaches leverage node attribute information. 

Link prediction methods leveraging attribute information 
first appear in the relational learning community 26, 20, 3] 
|30| . However, these approaches suffer from scalability issues. 
For instance, the largest network tested in [261 has about ZK 
nodes. Recently, Backstrom and Leskovec 121 proposed the 
Supervised Random Walk (SRW) algorithm to leverage edge 
attributes. However, SRW does not handle the scenario in 
which two nodes share common attributes (e.g. nodes U2 
and Us in Fig.[T]), but no edge already exists between them. 
Mapping link prediction to a classification problem [9| |17| 
[6] is another way to incorporate attributes. We have shown 
that classifiers using features extracted from the SAN model 
perform very well. Yang et al. [27] proposed to jointly pre- 



diet links and propagate node interests (e.g., music interest). 
Their algorithm relies on the assumption that each node in- 
terest has a set of explicit attributes. As a result, their 
algorithm cannot be applied to our scenario in which it's 
hard (if possible) to extract explicit attributes for our node 
attributes. 



Previous works in [22[ 23 aim at inferring node attributes 
(e.g., ethnicity and political orientation) using supervised 
learning methods with features extracted from user names 
and user-generated texts. Zheleva and Getoor [sT] map at- 
tribute inference to a relational classification problem. They 
find that methods using group information achieve good re- 
sults. These approaches are complementary to ours since 
they use additional information apart from network struc- 
ture and node attributes. In this paper, we transform the 
attribute inference problem into a link prediction problem 
with the SAN model. Therefore, any link prediction algo- 
rithm can be used to infer missing attributes. More impor- 
tantly, we demonstrate that attribute inference can in turn 
help link prediction with the SAN model. 

7. CONCLUSION AND FUTURE WORK 

We comprehensively evaluate the Social- Attribute Network 
(SAN) model proposed in [28', ^291 in terms of link prediction 
and attribute inference. More specifically, we adapt several 
representative unsupervised and supervised link prediction 
algorithms to the SAN model to both predict links and infer 
attributes. Our evaluation with a large-scale novel Google-f 
network dataset demonstrates performance improvement for 
each of these generalized algorithm on both link prediction 
and attribute inference. Moreover, we demonstrate a further 
improvement of link prediction accuracy by using the SAN 
model in an iterative fashion, first to infer missing attributes 
and subsequently to predict links. Interesting avenues for 
future research include devising an iterative algorithm that 
alternates between attribute and link prediction, learning 
node and edge weights in the SAN model, and incorporating 
edge attributes, negative node attributes and mutex edges 
into large-scale experiments. 
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